ABanditNAS: Anti-Bandit for Neural Architecture Search
95
According to Eq. 4.6, the variance in the anti-bandit algorithm is bounded, and the
lower/upper confidence bounds can be estimated as
˜rk −
2 ln N
n
≤rk ≤˜rk +
2 ln N
n
.
(4.8)
4.2.2
Search Space
Following [307, 151, 291], we search for computation cells as the building blocks of the
final architecture. A cell is a fully connected directed acyclic graph (DAG) of M nodes, i.e.,
{B1, B2, ..., BM} as shown in Fig. 4.13. Here, each node is a specific tensor (e.g., a feature
map in convolutional neural networks), and each directed edge (i, j) between Bi and Bj
denotes an operation o(i,j)(.), which is sampled from Ω(i,j) = {o(i,j)
1
, ..., o(i,j)
K
}. {Ω(i,j)} is the
search space of a cell. Each node Bj takes its dependent nodes as input and can be obtained
by Bj = Σi<jo(i,j)(Bi). The constraint i < j here is to avoid cycles in a cell. Each cell takes
the output of the last cell as input. For brevity, we denote by B0 the last node of the
previous cell and the first node of the current cell. Unlike existing approaches that use only
normal and reduction cells, we search for v (v > 2) cells instead. For general NAS search, we
follow [151] and take seven normal operations, i.e., 3×3 max pooling, 3×3 average pooling,
skip connection (identity), 3×3 convolution with rate 2, 5×5 convolution with rate 2, 3×3
depth-wise separable convolution, and 5 × 5 depth-wise separable convolution. Considering
adversarially robust optimization for NAS, we introduce two additional operations, the 3×3
Gabor filter and denoising block, for model defense. Therefore, the size of the entire search
space is K|EM|×v, where EM is the set of possible edges with M intermediate nodes in the
fully connected DAG. In the case with M = 4 and v = 6, together with the input node, the
total number of cell structures in the search space is 9(1+2+3+4)×6 = 910×6. Here, we briefly
introduce the two additional operations.
Gabor filter. Gabor filters [69, 68] containing frequency and orientation representations
can characterize the spatial frequency structure in images while preserving spatial relation-
ships. This operation provides superb robustness for the network [191]. Gabor filters are de-
fined as: exp(−x′2+γ2y′2
2σ2
) cos(2π x′
λ +ψ). Here, x′ = x cos θ+y sin θ and y′ = −x sin θ+y cos θ.
σ, γ, λ, ψ, and θ are learnable parameters. Note that the symbols used here apply only
to the Gabor filter and are different from the symbols used in the rest of this chapter.
Figure 4.2(b) shows an example of Gabor filters.
Denoising block. As described in [253], adversarial perturbations on images will intro-
duce noise in the features. Therefore, denoising blocks can improve adversarial robustness by
denoising features. Following this, we add the nonlocal mean denoising block [22] as shown
in Fig. 4.2(c) to the search space to denoise the features. Calculate a denoised feature map
z of an input feature map x by taking a weighted mean of the spatial locations of the fea-
tures in general L as zp =
1
C(x)
∀q∈L f(xp, xq) · xq, where f(xp, xq) is a feature-dependent
weighting function and C(x) is a normalization function.
4.2.3
Anti-Bandit Strategy for NAS
As described in [274, 291], the validation accuracy ranking of different network architectures
is not a reliable indicator of the final quality of the architecture. However, the experimental
results suggest that if an architecture performs poorly at the beginning of training, there
is little hope that it can be part of the final optimal model [291]. As training progresses,
this observation becomes more and more specific. On the basis of this observation, we
derive a simple but effective training strategy. During training and the increasing epochs,